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IN THE CLAIMS: 

Kindly replace the claims of record with the fonowing full set of daims; 

1 . (Previously presented) An audio-visual system for processing video data 
comprising: 

an object detection module capable of providing a plurality of object features 
from the video data; 

an audio processor module capable of providing a plurality of audio features from 
the video data; 

a processor coupled to the object detection and the aiidio segmentation modules, 
arranged to determine a maximum correlation value among a plurality of correlation 
values between the plurality of object features and the plurality of audio features, wherein 
said correlation values are determined as the sum elements in a subset of said audio 
features selected from the group consisting of: two or more of the following: average 
energy, pitchy zero crossing, bandwidth, band central, roll off, low ratio, spectral flux and 
12 MFCC components, and selected object features. 

2. (Original) The system of claim U wherein the processor is further arranged to 
determine whether an animated object in the video data is associated with audio. 

3. (Cancelled) 

4. (Original)The system of claim 2, wherein the animated object is a face and the 
j^ocessor is arranged to determine whether the face is speaking. 

5. (Previously presented)The system of claim 4, wherein the plurality of object features 
are eigenfaces that represent global features of the face. 

6. (Previously presented)The system of claim 1, further comprising: 

a latent semantic indexing module coupled to the processor and that preprocesses 
the plurality of object features and the plurality of audio features before the correlation is 
performed. 
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7. (Original) TTie system of claim 6, wherein the latent semanti c indexing module 
includes a singular value decomposition module. 

8. (Previously presented)A method for identifying a speaking person within video data, 
the method comprising the steps of: 

receiving video data including image and audio information; 

determining a plurality of face image features from one or more feces in the video 

data; 

determining a plurality of audio features related to audio information; 

calculating correlation values between the plurality of face image features and the 
audio features, wherein said correlation values are determined as the sum of elements in a 
subset said audio features selected from the group consisting of two or more of the 
following: average energy, pitch, zero crossing, bandwidth, band central, roll off, low 
mtio, spectral flux and 12 MFCC components, and said face image features; and 

determining the speaking person based on a maximum of the correl ation values. 

9. (Previously presented) The method according to claim 8, further comprising the step 
of: 

normalizing the face image features and the audio features. 

10. (Previously presented) The method according to claim 9.. further comprising the step 
of: 

performing a singular value decomposition on the normalized face image features 
and the audio features. 

11. (Original) The method according to claim 8, wherein the determining step includes 
determining tte speaking person based upon the one or more faces that has the largest 
correlatiorL 
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12. (Original) The method according to claim 10, wherein the calculating step inchides 
foming a matrix of the face image features and the audio features. 

13. (Previously presented) The method according to claim 12, furttier comprising the step 
of: 

perfonniTig an optimal ^proximate fit using smaUer matrices as compared to full 
rank matrices formed by the face image features and the audio features, 

14. (Original) The method according to claim 1 3, wherein the rank of the smaller 
matrices is chosen to remove noise and unrelated infonnation from the full rank matrices. 

15. (Previously presented) A memory medium including code for processing a video 
including images and audio, the code comprising: 

code to obtain a plurality of object features from the video; 

code to obtain a plurality of audio features from the video; 

code to determine correlation values between the plurality of object features and 
the plurality of audio features, wherein said correlation values are determined as the sum 
of elements in a subset of said audio features selected firom. the group consisting of two or 
more of the following: average energy, pitch, zero crossing, bandwidthj band central, roll 
oflF, low ratio, spectral flux and 12 MFCC components, and selected object features; and 

code to determine an association between one or more objects in the video and the 
audio based on a maximum of the correlation values. 

16. (Original)The memory medium of claim 15, wherein the one or more objects 
comprises one or more faces. 

17. (Original)The memory medium of claim 16, further comprising code to determine a 
speaking face. 

18. (Previously presented)The memory medium of claim 15, furtfier comprising: 
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code to create a matrix using the plurality of object features and the audio features 
and code to perform a singular value decomposition on the matrix* 

19. (Previously presented)The memory medium of claim 1 8, further comprising: 

code to perfonn an optimal approximate fit using smaller matrices as compared 
to fiill rank matrices formed by the object features and the audio features. 

20. (Original)The memory medium accordii^ to claim 19, wherein the rank of the 
smaller matrices is chosen to remove noise and unrelated information from the Ml rank 
matrices. 
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